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ABSTRACT 

In   this  paper  we   describe   an  envelope-oriented  block  Gaussian 
elimination  method  for  solving  linear  systems   arising   from  the  use   of 

five-point   finite   difference   approximations   on  n  x  n  square   grids.      We 

5/2  7/2 

show  that   the  method  requires  0(n        )    storage  locations    and  0(n      ^) 

arithmetic  operations.      Asymptotically  this   is    a  significant  improvement 
on   the   costs   of  ordinary  band  and  envelope  methods,  but  we   find  that 
for  moderate   values   of  n   (say,   n   <:   50)  ,   the  band  and  envelope   methods 
are   superior. 


1.      Introduction 

Given   a  positive   integer  n,    let   D     be    a  regular  n  x  n   square    grid 

2 
with   an  unknown  x.    associated  with  each  of  the  n     grid  points.      Letting 

2 

N  =   n    ,  we    consider  the   solution  of  an  N  x  N   system  of  linear  equations 

Ax  =  "b,  (1.1) 

where   a.  .   4   0  if  and  only  if  the  nodes    corresponding  to  x.    and  x.    are 
connected  by   a  line   in   the   grid.      Systems   of  this    form   arise   in  the   use 
of  five-point    finite   difference   approximations   in  the  numerical   solution 
of  second-order  elliptic  partial   differential   equations   over  sauare 
regions.      In  order  to  simplify  our   discussion   somewhat,  we  will   assume 
that  A  is   symmetric    and  positive   definite,    although   similar  results  hold 
in  certain  other  cases    (e.g.,  when  A  is  nonsymmetric   and  diagonally 
dominant) . 

A  possible  method  for  solving   (l.l)    is   the   Choleski    decomposition, 

T 
i.e.  ,    first   factoring  A  into  the  product  LL    ,  where  L  is   lower  triangular, 

T 
and  then  solving  Ly  =  b   followed  by  L     x  =  y.      We   consider  here   a  variant 

of  this  method  in  which  we   do  not  explicitly  obtain  L.      Either  algorithm 

may  be   stably   applied  to   the  permuted  system 

(PAPT)(Px)   =    (Pb)  (1.2). 

instead  of  (l.l),  for  any  N  x  N  permutation  matrix  P  [lTl»  and.  this  fact 

allows  us  to  vary  the  ordering  of  the  grid  points  (i.e.,  the  correspondence 

2 
between  the  unknowns  x.  and  the  n  grid  points  of  D  )  in  any  way  that  seems 

useful.   In  particular,  we  will  attempt  to  minimize  the  amount  of  storage 

or  work  required  to  solve  the  resulting  linear  system  (1.2).   These  two 

goals  are  related,  but  it  is  usually  not  possible  to  attain  both 

simultaneously. 


If  L   is   to  be   computed  explicitly,   then  there   are   several   different 
algorithms  which  may  he   used  to   factor  A.      Primarily,   these   differ  in  the 
degree   to  which  they   take   advantage  of  the  known   zero   structures   of  A  and 
L.      Asymptotically,   the  most   efficient   in  terms   of  both  storage   and  wor> 
are   the  general   sparse  matrix  methods,  which   store   and  operate  only  on  the 
nonzero  entries   of  A  and  L.      With   an   appropriate  ordering  of  the   grid  points 
(e.g.,   the  nested  dissection   ordering    [l]),   these   require  only  0(n     log  n) 
storage   locations   and  0(n    )    arithmetic  operations    [1,8].      However,    the 
overhead,    i.e.,   the  number  of  extra  storage   locations    and  nonnumeric 
operations    required  to   avoid  storing  and  operating  on   zeroes   in   A  and  L, 
may  be   quite  high   for  general  sparse  methods.      Furthermore,    they  may  not 
perform  very  well   in  paged  storage  environments,   since   they  tend  to   access 
matrix  entries  which  are  physically  stored  far   apart.      General   sparse 
matrix  methods  have  been  primarily  used  for  large  problems  ,■  although  it 
has  been   found  that  they  may  offer  savings   even  when  n  is   small    [9l- 

In   contrast,  we   can  use   standard  band   [ll]   or  envelope    [T»l^] 
algorithms.      These   are  easy   to   implement  and  use    and  have  low  overhead 
requirements,   but  they   require  0(n    )    storage  locations    and  0(n    )    arithmetic 
operations   asymptotically   [3].      Hence   these  methods   are  mainly  useful  when 
n  is   so   small   that   the  overhead,   instead  of  the  amount   of  storage  or  number 
of  operations,   is    critical. 

Between  these  two  extremes   is    a  whole   class   of  hybrid  block  methods 
[5,6,13]   which,   though   fairly   complex  to   implement,  offer  a  means  of 
achieving  the   relatively  low  overhead  of  band  and  envelope  methods  without 
their  large   storage   and  work   requirements.      Generally,   these  methods 
divide  A  into   a  number  of  independent  blocks    and  obtain  the   factorization 
of  A  by  operating  on  the  blocks    and/or  their  factorizations.      Storage 


3 
references   tend  to  be   localized  to  the   entries   of  the   individual  blocks,   so 
that  block  methods   seem  to  be    far  better  suited  to  paged  storage   systems 
than  are  the   other  direct  methods.      Moreover,  they  may  prove   useful    for 
solving  systems   so  large   that   auxiliary   storage   is   required  to   store   A  or  L. 

In    [5],  George  describes   a  block  method   for  the  solution  of   (l.l) 
over  a  nine-point  finite  element  grid.      He   divides   A  into  blocks   corresponding 
to   independent   vertical   regions  of  the  grid,  uses    a  band  solver  to    factor 
the  blocks,   and  combines   the   factored  blocks   so    as  to  obtain   an   implicit 
form  of  L.      This  method  might  be   termed  block  envelope  with  band  Choleski 

decomposition  used  in  each  independent  block,   and  George   shows   that  it 

cr  /  p  7  /  p 

requires  0(n  ;    )    storage  locations    and  0(n        )   multiplicative  operations 

(i.e.,   multiplications    and  divisions).      The   use  of  envelope    Choleski, 

instead  of  band  Choleski  ,  would  yield  modest   reductions   in  the  lead 

coefficients,  but  not   the  orders,  of  the  expressions    for  the   storage   and 

work  requirements   of  the  method. 

George's  method  could  be   applied  directly  to  our  model  problem, 
giving  the   same   storage   and  work   requirements    as   in   [5l>  but   there   is   an 
analog  of  his  method  which  is   substantially  better.      Instead  of  vertical 
separators   in   the   grid,    diagonal   separators    are   used   to    divide   the   grid 
into   independent   regions,   and  this    change  leads   to   approximate   asymptotic 
savings   of  h0%   in  storage   and  80%   in  work. 

In   this   paper  we  present  the   details  of  the  diagonal  block  method 
and  evaluate   its   usefulness  through  a  theoretical   and  experimental  analysis. 
Despite  the  apparently  substantial   asymptotic  savings  over  simpler  methods, 
we  have  not   found  that   the  use  of  either  our  method  or  George's   is  partic- 
ularly worthwhile.      Apart    from  some  practical   difficulties   in  programming 
the  methods,   it   appears   that   they  suffer  from  a  common  malady  among  hybrid 


k 

methods   in   that   there   is   no  range  of  problems    for  which  they   are   the  best 
methods   available.      For  n   £   50,  both  methods   require  more  work  than  more 
standard  band  or  envelope  methods.      And  for  larger  values    of  n,    the    actual 
amount  of  work  is    critical,    so   that  the  higher  overhead  with   general   sparse 
methods    does   not  matter  very  much.      It   should  be  noted  here   that   these 
conclusions    are  based  on  the   assumption  that   the  work,   not  the   storage,   is 
of  primary  importance.      If  this   is  not  the    case,  then  the  use  of  either  our 
method  or  George's  may  be   quite   fruitful,   although  the   recently  developed 
minimal  storage   Gaussian  elimination  algorithms  may  be  even  better,   since 
they   require  only  0(n    )    storage  locations    [3,1*+ ]• 

2.      Description   of  thei  Method 

For  some  positive   integer  a   <<  n ,   choose   a  -  1  lines  of  grid  points, 
or  separators,  parallel  to  the  line  y  =   x,    in  such   a  way  that  the   semi- 
perimeter  of  the   grid  is   divided  into   roughly  equal   segments    (see  Figure   2.1), 


Figure   2.1:      n  x  n   grid  divided  by  separators   for  a  =   h. 
These   separators   divide   the  grid  into   a   distinct   regions  which   are   independent 
in  the   sense  that  if  x.    and  x.   lie  in   different  regions,   then   a.  .    =   0  in 
(1.1). 

Two   considerations  lead  to   the   choice   of  diagonal,    rather  than 
vertical,   separators.      First,   the   sizes  of  the   diagonal   separators   decrease 


near  the   corners   of  the   grid,   so  that  the   average   separator  size   is 
approximately  n/2   instead  of  n.      Since   a  large  part   of  the  work  is    associated 
with  the   separators,   this   reduction   in  size  has   a  substantial   effect  on   the 
total  work.      Second,    diagonal   separators   in   a  rectangular  five-point   grid 
lead  to   matrix  blocks  which   are    diagonal   matrices. 

Once  the  grid  has  been   divided  into   independent   regions  by  the 
diagonal   separators,   the   regions    and  separators    are  numbered  in  the  order 
indicated  by  Figure  2.1.      The  nodes  within  each  region   are   sequentiallv 
ordered  along  diagonals  parallel   to   the  line  y  =  -x.      We    call   this   method 
of  numbering  the   nodes  the    anti-diagonal  ordering  scheme,    and  it   is    an 
analog  of  the   so-called  diagonal  ordering  scheme  which   is   known  to  be   quite 
efficient    for  rectangular  five-point  grids    [2,7].      The  nodes   in   each   separator 
are   ordered  sequentially   from  the   lower  left   to   the   upper  right   in  the    grid 
(see   Figure   2.2). 
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Figure   2.2:      Grid  ordering  for  n  =   9,    a  =   k. 


For  the  example   in  Figures   2.1   and  2.2,   the  matrix  A  has   the 
following  block   form: 


h 


J 


A  = 


B. 


B, 


b; 


(2.1) 


B, 


Factoring  A  into  the  product  LL  we  obtain: 


L  = 


-T 

ciLi 


-T 
B2L2 


-T 
C2L2 


-T 

BuLi, 


-T 


where    (following  George    [5]) 


A.    =   L.    LT 
1  11 

A3  =   A3-B2  a;1  B^   A"1  C* 


T 

3  ~3 


L.  L 


A.    =   A.-B. 


-1        T 
A,    .    B_:      -F. 


f:  „-c. 


-1 


(2.2) 


-T 

B6L6 


i=l,2,k96      (2.3a) 
(2.3b) 


1-1  Ai-1  Bi-rFi-2   Fi-2"Ci-3  Ai-3  Ci-3  =   Li   Li'      i=5'T 


ft  =  l:1  Bi.  1  a:\  ct  , , 
i      1     1-1  1-1   i-i' 


(2.3c) 
i=3,5  (2.3d) 

For  a   >    U,   the  pattern  of  the   last  two   rows    and  columns   of  A  and  L  would 
repeat. 

Again   following  George    [5],   L  is   not  obtained  explicitly   and  only 
the  B's,   C's,   L's,   and  F's    are   stored.      Given   these  matrices,   the   solution 
of   (1.1)   with  A  given  by    (2.1)   may  be  obtained  as   follows    (it  is    assumed 
that  b   is   divided  compatibly  with  the  blocks   of  A): 


(i)  Solve   L.    y.    =  b. ,    for  i  =   1,2,U,6. 

(ii)        Set  b     =  "b3-B2  L~     Yg-C^  L~     y^   and  solve   L^  y^  =  b^ 

(iii}      Set  \  =  bi-Bi-l   Lx-1  yi-l-Fi-2  yi-2-Ci-3  LI-3  yi-3'    and  S°1Ve 

L.   y.    =  b. ,   for  i=5,7. 
11  1 

T 

(iv)        Solve  L     x     =  y„. 

_1     T  T  * 

(v)  (a)      Set  yg  =  Yg-Lg     Bg  x    ,    and  solve   Lg  x^  =  y^. 

ip  rp 

(b)      Set  y     =  ycr-Fc  xtj    and  solve  L     x     =  y    . 

-1     T  -1      T  T 

(vi)         (a)      Set  y^   =  y^-L^      C^   x^-L^      B^   x    ,    and  solve  L^   x^  =  yy 

^  m  rp  ^ 

(b)      Set  y     =  yo-Fo  xc  >   and  solve   L     x     =  y    . 

—1      T  -1     T  T 

(vii)      (a)      Set  y2  =  y2"L2     C2  x  -Lg     B2  x^   and  solve   Lg  x2  =  Vg. 

"-IT  T  A 

(b)      Set  y1  =  y1-L1      (^   x   ,    and  solve  1^  xx  =  y1 . 

Within  each  numbered  group  of  computations,   the   lettered  computations  may  be 

performed  in  any  order  or  simultaneously.      The   solution  scheme   extends   to 

larger  values   of  a  in  the  obvious  way. 


3.      Storage   Requirements 

In   this   section  we   analyze   the   storage   requirements  of  the  method 
described  in  Section   2.      Throughout   the   calculations  we   assume  that    a  <<  n. 

The  independent  A.  '  s    are   quite   similar  to  matrices   derived  from  the 

use   of  a   five-point  operator  on  long,    thin,   rectangular  domains.      For  such 

matrices   it   is  well  known    [T,10]    that   envelope  methods    are  more   efficient 

than  band  methods.      Hence  the   independent  A.'s    and  the   corresponding  L.'s 

c  l  i 

are   stored  and  operated  on   as   envelope  matrices ,   so   that  the   storage   reauired 

is   given  by 

Sp(a)    =    (--  ~r)n3  +  0(n2).  (3.1) 

B  a        3a2 


The  A.  '  s    corresponding  to  the   diagonal   separators   are   diagonal 

matrices,  hut   the   corresponding  A.'s    and  L.'s   tend  to  be  quite   dense.      Hence 

11 

their  entire   lower  triangles    are   stored,   and  the   storage   required  is   given  by 

S5(o)   =    (|+  ^-)n2  +  o(na).  (3.2) 

The  B.'s    and  C.'s    are  quite   sparse,   so   a  standard  sparse  matrix 
storage   scheme   is   used  for  them.      The   storage   required  is   given  by 

SBC(a)   =   2a  n  +  0(a).  (3.3) 

Finally,   the   F.'s   are   generally   full.      Therefore,   the   storage 
required  for  them  is   given  by 

Sjo)   =    (§  +  ~)r?  +  0(no).  (3.M 

F  3       3a 

Summing   (3.l)-(3.M»   the   total   storage   required  by  the  method  is 

ST(a)    =    (a   "   "%)n3  +    (f)n2  +   °(n2)'  (3-5) 

3a 

This   estimate   is    approximately  minimized   for  a    =   a     =    /2n,    and  for  n 

sufficiently   large, 

ST(aQ)   =    v^n5/2  +   0(n2).  (3.6) 

By  way  of  comparison,  George    [5]   obtains    a  minimal   storage  estimate  of 

v6  n  +   0(n    )  ,   so   that   there   is    a  significant  saving  in  using  diagonal 

separators   for  the   five-point  operator.      For  the   standard  band  and  envelope 

13  2 

methods,  the   storage   requirements    are    at  least  —  n     +   0(n    )    [3],   so  again 

there   is    a  significant   saving.      Although   (3.6)    is    an   asymptotic   result, 

Table    3.1   shows    that  there   are   large   savings   in   storage   even   for  small 

values   of  n.      In  fact,   as  we   shall   see,  the   storage   requirement   for  the 

method  may  be   its   primary  virtue. 


n 

ao 

ST(aQ) 

1    3 
2n 

8 

2 

267 

256 

16 

h 

1530 

20U8 

2k 

6 

1+177 

6Q12 

32 

8 

8592 

1638U 

Uo 

8 

1U82U 

32000 

U8 

8 

23392 

552Q6 

Table    3.1 
Experimentally  Determined  Optimal  Storage  Requirements 


h.     Work  Requirements 

We  now  estimate   the  number  of  multiplicative  operations   required 
to  obtain   the   implicit  block   form  of  L.      Again  we   assume  that   a   <<  n.      The 
work   can  be  broken  into   four  sections  which   are   treated  separately: 

(i)  decomposition  of  the   independent   A.'s; 

(ii)        computation  of  the  A.'s; 

(iii)      decomposition  of  the   A.'s; 

(iv)        computation  of  the   F.'s. 

The   decompositions  of  the  A.'s    corresponding  to   the  independent 
grid  regions    are  performed  as   envelope   Choleski    decompositions    [7,10]. 
Results   of  George    [7]    can  be   used  to   compute   the  total    amount  of  work   for 
the  independent  matrices.      This   is   given  by 

WB(a)    =  -^nU   +  0(^-).  (U.1) 

2a 

The   computation  of  the  A.'s   is   very  sensitive  to  the  order  in 

which  we  perform  the  matrix  operations.      Following  George    [5],   we  obtain 


10 


-1        T 
terms    of  the    form  B.        =  B.    ,    A.    ,    B,        hv   first   computing 

i-1  l-l      l-l      l-l 

b.  .  =  l:\   lT1.,   BV  _  (U.2) 

l-l  l-l      l-l      l-l 


and  then   computing 


Bi-i  "  Bi-i  Bi-i  •  c*-3' 


The   computation   of  B  can  be   done    as   two   envelope  backsolves  ,   and  we   can 

take   advantage  of  the   sparseness  of  B.    .    to   compute  B.    ,    from  B.    ,    in   a 

l-l  l-l  l-l 

small  number  of  additional  multiplications.      In  this  way  all  of  the   A.'s 
can  be   computed  in   a  total   of 

Ws    (a)   =    (rf  +  ^)nk   +   (f  +  |)n3  ♦  O(^)  (k.k) 

1  3a 

mult i pi i cations . 

Once  the  A.'s  have  been  obtained,  they  can  be  decomposed  in  a  total 

of 

ws2(a)  =  (517+6)n3  +  0(r»  (t-5) 

additional  multiplications,    assuming  that  the  A.'s    are   generally   full,   so 
that   sparse   decomposition  methods   do  not  help  much. 

Using   ideas   similar  to  those  of   (U.2)    and   (U.3),   the   F.'s    can  be 
obtained  efficiently  by   first   computing 


and  then   computing 


=i-i "  Lili  LI-i  ci-i  (1,-6> 


FT  =  lT1  c.  ..  (U.T) 

l  l         l-l 


Using  this   technique    and  noting  that  the   terms    C.        also   appear  in  the 

computation  of  the   A.'s,  we    find  that  the  number  of  additional  multiplications 

needed  to   compute  the  F.'s   is   given  by 

1  3 

WF(a)   =    (f  -  |)n3  +  O(jj-).  (U.8) 
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Summing  and  simplifying,   the   total  number  of  multiplicative 
operations   required  for  the   implicit   decomposition  of  A  is    given  "by 

WT(a)    =    (3|+^+(|f+l^+0(4. 

6a 


(U.9) 


W  (a)    is    approximately  minimized  for  a   =    a 


Pn 
5 


,    and 


W_(ol)   =  i  fio  nT/2  +  0(n3)    . 
T     1  3 

For  a  =   a    ,    (3.5)   yields 

ST(a1)    =  ^  v^O  n5/2  +  0(n2), 
and  conversely,    for  the   a     which  minimized  storage,   the  work  is 


3 


7/2 


W    (a    )    =  f  J2nUd  +   0(nJ). 


(U.io) 


(i+.n) 


(U.12) 


T    0' 

So  there   is   little  to   choose  between  minimizing  the  work   and  minimizing 
the   storage. 

Both   (1+.10)    and  (1+.12)    compare   favorably  with  George's  work   estimate 
of  2    /lO  n  +  0(n    )    [5].     However,   all   these   results    are   valid  only  in   the 

asymptotic   case  of  large  n.      For  values   of  n    £   50,  we  have    found  that   the 
minimal   amount   of  work  is    always    at   least  n   /k    (see   Table  l+.l). 
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-    nk/k 

8 

2 

267 

h 

3hfh 

1021+ 

16 

1+ 

1530 

8 

38810 

163BU 

2k 

6 

Ul77 

10 

159^96 

82  Q  U1+ 

32 

8 

8592 

12 

U29605 

2621 1+1+ 

1+0 

8 

1U82U 

12 

928186 

61+nooo 

1+8 

8 

23392 

lU 

17  396 1+2 

1327101+ 

Table   l+.l 
Experimentally  Determined  Optimal  Storage   and  Work  Requirements 
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5.      Conclusions 

In   conclusion  we  make   the   following  remarks: 
(i)  In  terms   of  asymptotic   storage   requirements,   our  method   (like 

George's    [5])   has   a  great  advantage  over  the   standard  band  or  envelope 
storage   schemes.      In  practice,    this  holds    for  even   very  small  values   of 

n    (see  Table   3.1).      However,   the  minimal  storage  methods    [3,1*+]    require 

o 
only  0(n'~)    storage   locations,    and  recent   research    [9,1^,1^]    indicates 

that   general    sparse  methods  may  be   implemented  so   as   to   require   less 

storage   than   our  method,    even   for   small   values  of  n. 

(ii)  Asymptotically,  our  method  is   an   improvement  over  standard  band 

or  envelope  methods   in  terms  of  the  number  of  multiplicative  operations , 

and  it  might  be  hoped  that  it  would  be   competitive  with   general   snarse 

matrix  methods   for  some   range  of  n.      However,    for  moderate  values   of  n 

(say  n   £  50),    for  which  it  might  not  be  worthwhile  to  use   a  general   snarse 

matrix  method,   the  method  appears   to  be  no  better  than   an   envelope  method, 

except,  perhaps,   in  terms  of  the  work   required  for  backsolution.      In   fact, 

Schultz   [15]   has  proposed  a  very  simple  block  envelope  scheme  which 

1  )  "5 

requires  only  77  n     +  0(n    )   multiplicative  operations    for   a  symmetric  n  x  n 
five-point   finite   difference   grid.      Hence   it   appears   that  where   the  number 
of  arithmetic   operations   is    critical,   there   is  little   use   for  our  method. 
(A  similar  qualitative  statement  may  be  made    concerning  George's  block 
method  for  nine-point   finite  element   grids    [5jQ]K 

(iii)        From   (i)    and   (ii)    it  may  be    concluded  that  our  method  is   not 
generally  useful   for  the   direct  solution  of  (l.l).      However,    if   (l.l)    is 
to  be   solved  for   a  number  of  different   right  hand  sides,  the  method  may 
prove   quite  efficient  because   it  substantially   reduces    the   storage   require- 
ments  and  the  number  of  arithmetic  operations  needed  for  backsolution. 


13 
Minimal   storage  methods   are  not  efficient  in  this    case,    since   they  do  not 
save   a  factorization  of  A.      Typical   examples   of  such  effective  use  of  the 
method  might   arise   in  the  solution  of  semi-linear  or  nonlinear  partial 
differential   equations   using  certain   iterative   techniques    for  which 
backsolving  is  the   critical  operation   (see    [h]) . 
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